-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs.http: prevent hangs under some network conditions #7460
Conversation
@dtrifiro Is it possible to show the difference before and after? Possibly showing the output from each, a video, or a reproducible script? |
@@ -114,7 +120,7 @@ async def get_client(self, **kwargs): | |||
total=None, | |||
connect=self.REQUEST_TIMEOUT, | |||
sock_connect=self.REQUEST_TIMEOUT, | |||
sock_read=None, | |||
sock_read=self.REQUEST_TIMEOUT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that setting the sock_read=self.REQUEST_TIMEOUT
(60s) will not create any issues: the default buffer size is 2**16,
See ClientSession._request
:
https://github.com/aio-libs/aiohttp/blob/f5ff95efe278c470a2ff65cabbb5f5f08ba07416/aiohttp/client.py#L511-L519
For the record, here's a few examples of failing CI (timeout) on
Here's a few example of succeeding CI (running this branch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing research!
+1, amazing work 🔥 |
dvc/fs/http.py
Outdated
client_kwargs["connector"] = aiohttp.TCPConnector( | ||
enable_cleanup_closed=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more detail on this? I'd like to see why this is not the default in aiohttp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like aiohttp used to do this by default 5 years back but was changed by this commit aio-libs/aiohttp@02b0951 as an option. But I could not find any reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From https://docs.aiohttp.org/en/stable/client_reference.html#tcpconnector:
Some ssl servers do not properly complete SSL shutdown process, in that case asyncio leaks SSL connections. If this parameter is set to True, aiohttp additionally aborts underlining transport after 2 seconds. It is off by default.
From the description of how it works, it seems more reasonable that this isn't the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too wondering why this is not the default, but digging around could not find any reasoning either. I guess this only becomes an issue under particular network conditions (broken TLS connection?), which were often encountered in dvc bench due to our large number of concurrent requests in dvc bench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this could be the context for the change: aio-libs/aiohttp#1767.
Edit to add some more explanation: The discussion is right around the time of the change, the contributor is in the discussion, and they discuss the related (now outdated) attribute cleanup_closed_disabled
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could also ask 😄
1745f16
to
12de7e7
Compare
Added some context comments with the last force push |
@@ -114,7 +123,7 @@ async def get_client(self, **kwargs): | |||
total=None, | |||
connect=self.REQUEST_TIMEOUT, | |||
sock_connect=self.REQUEST_TIMEOUT, | |||
sock_read=None, | |||
sock_read=self.REQUEST_TIMEOUT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious to see if only this will solve the problem in dvc-bench
, and a bit hesitant on adding TCPConnector
without trying out this solution first. Would you mind opening another PR with just this line change? We can merge that, and we can keep this PR open for a 1-2 days to monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried, it solves the issues on my machine ™️ (MacOS) but it still freezes on the dvc-bench
CI
Gonna test out a few extra scenarios before merging this |
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
We have been getting reports that the timeout on sock_read was raising timeout error even for chunked uploads, and sometimes even uploading zero-byte files. See: https://github.com/iterative/dvc/issues/8065 and iterative/dvc#8100. These kinds of logics don't belong here, and should be upstreamed (eg: RetryClient/ClientTimeout, etc). We added timeout in iterative/dvc#7460 because of freezes in iterative/dvc#7414. I think we can rollback this for now given that there are lots of report of failures/issues with this line, and if we get any new reports of hangs, we'll investigate it separately.
When using an https remote (such as in the dvc-bench remote, see iterative/dvc-bench#319), under certain situations,
dvc pull
would freeze and never return. Note that this did not happen 100% of the time, but often enough to make the bench CI always fail (note: CI was running a matrix strategy over ~12 workers, pulling a 25k file dataset on each).Investigation pointed to network issues (as the issue could not be reproduced locally). I ended up tracking this to a connection being dropped/lost and
aiohttp
not realizing this. Due to our timeout defaults foraiohttp.ClientSession
:(note that
sock_read=None
),aiohttp
would keep trying to read from a dead socket thus hang.By instantiating a
aiohttp.TCPConnector
withenable_cleanup_closed=True
, the underlying transport is forcefully closed when the connection is lost, so that the request fails andRetryClient
retries.Thanks @pared and @karajan1001
fixes #7414